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The Projection Congruent Subset (PCS) Outlyingness is a new 
index of multivariate outlyingness obtained by considering univariate 
projections of the data. Like many other outlier detection procedures, 
PCS searches for a subset which minimizes a criterion. The difference 
is that the new criterion was designed to be insensitive to the outliers. 
PCS is supported by FastPCS, a fast and affine equivariant algorithm 
which we also detail. Both an extensive simulation study and a real 
data application from the field of engineering show that FastPCS 
performs better than its competitors. 

1. Introduction. Outliers are observations that do not follow the pat- 
tern of the majority of the data (Rousseeuw and van Zomeren (1990)). Out- 
lier identification is a major part of data analysis for at least two reasons. 
First, because a few outliers, if left unchecked, will exert a disproportionate 
pull on estimated parameters, and we generally do not want our inferences 
to depend on such observations. In addition, we may want to find outliers to 
study them as objects of interest in their own right. In any case, detecting 
outliers in more than two dimensions is difficult because we can not inspect 
the data visually and have to rely on algorithms instead. 

Formally, this paper concerns itself with the simplest, most basic variant 
of the multivariate outlier detection problem. The general setting is that of 
a sample of n observations Xi £ TV (with n > p), at least h = [(n-|-p-|- 1)/2] 
of which are drawn from a model with continuous support. The objective 
is to identify reliably the index of the remaining ones. A more complete 
treatment of this topic can be found in the book by Maronna, Martin and 
Yohai (2006). 

In this article we introduce PCS, a new procedure for finding multivariate 
outliers. We also detail FastPCS, a fast algorithm for computing it. The 
main output of FastPCS is an outlyingness index measuring how much each 
observation departs from the pattern set by the majority of the data. The 
PCS outlyingness index is affine equivariant (meaning that the outlyingness 
ranking of the observations is not affected by a linear transformation of the 
data) and can be computed efficiently for moderate values of p and large 
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values of n. 

To derive this index, FastPCS proceeds in two steps. First, it strives to 

select among many possible h subsets of observations one devoid of outliers. 
Then, the outlyingness index is simply the distance of each observation to 
this subset. 

For easier outlier detection problems, we find that our approach produces 

results similar to state of the art outlier detection algorithms. When consid- 
ering more difficult cases however we find that the solution we propose leads 
to significantly better outcomes. 

In the next section we motivate and define the PCS outlyingness and 
FastPCS. Then, in Section 3 we compare FastPCS to several competitors on 
synthetic data. In Section 4 we conduct a real data comparison. In Section 
5, we offer closing thoughts and discuss directions for further research. 

2. The PCS outlyingness. 

2.1. Motivation. Throughout this note, we will denote as Hm a subset 
of the indexes {1, 2, . . . , n}, of size h. For any we denote its mean and 
covariance matrix as 



and we will write the squared Mahalanobis distance of an observation Xi as 



FastPCS starts by selecting among many such subsets an /i-subset of obser- 
vation that minimizes a criterion. Other affine equivariant outlier detection 
algorithms that proceed in this way are the FastMVE (Maronna, Martin and 
Yohai 2006), the FastMCD (Rousseeuw and Van Driessen 1999) and the SDK 
(Stahel 1981). For all four procedures we will denote the chosen /i-subset as 



Each of these algorithms relics on a different criterion to select the optimal 
/i-subset. In all cases, there is a straightforward relationship between H^, and 
the resulting outlyingness index. For example, for FastMVE and FastMCD 
the outlyingness index is the vector of distances dj^ ^{t^, S^). For SDE the 
outlyingness index is the vector of values of PM{xi) 



(2.1) 




(2.2) 




(2.3) 



Pnixi) = sup 



x'^am - med^=i(x^a,„)| 



mad"=i(a;^a^) 
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where is a set of Mp directions orthogonal to hyperplanes spanned by a 
p-subset of {1 : n} and H^, contains the observations with smallest values of 

In all cases, if H^, itself is contaminated, then the resulting outlyingness 
index can no longer be depended upon to find the outliers. For FastMVE 
and FastMCD, i/* is the subset of h observations with smallest covariance 
matrix determinant and enclosed by the ellipsoid with smallest volume, re- 
spectively. Given enough outliers, an adversary can easily confound these two 
characterizations (for example, by placing outliers on a subspace of R^) and 
ensure that a contaminated /i-subset will always be chosen over /i-subsets 
formed of genuine observations. 

It is not always recognized that the SDE outlyingness index is also sen- 
sitive to the outliers. Given enough outliers, they can always be placed in 
such a way that the denominator in equation (2.3) will be smaller for some 
direction a* along which the numerator is smaller for outliers than for gen- 
uine observations (for example, consider a direction a* close to the principal 
axis of the orange ellipse in the lower-left panel in Figure 1). Then, for many 
observations the maximum outlyingness will be attained at am = «*■ Re- 
peated over a large number of such directions, this will cause the value of 
the outlyingness index in equation (2.3) to be smaller for many outliers than 
for genuine observations. 

Consider the following example. The four panels in Figure 1 all depict 
the same 70 draws from a standard bivariate normal distribution together 
with 30 draws from a second normal distribution with smaller variance and 
centered at (5,-1). Orange ellipses in the first three plots show the (t*, S**) 
obtained using the best subset found by the FastMCD, FastMVE and SDE. 
These were computed with the R package rrcov (Todorov and Filzmoser 
2009), using default settings; 500 starting p-subsets (for the first two), and 
500 directions Um (for SDE); and 50 percent BDP in all cases. 

In all three cases, the fitted ellipses (drawn in solid orange lines), fail to 
adequately fit any /i-subset of the data, including the subset they enclose. 
In particular, their centers of symmetry (orange stars) are not located in 
areas of highly concentrated observations. In all cases, the model fitted on 
i?* appears visually distinct from the distribution governing the good part 
of the data (drawn as a dashed blue ellipse). This is confirmed by the biases 
(a dimensionless measure of dissimilarity between two subsets, see Section 3) 
of 7, 5 and 3.5 for the three algorithms. 

The algorithm we propose differs from these methods in that it uses a new 
measure of multivariate congruence (which we detail in the next section) to 
select the optimal /i-subset. The main advantage of this new characterization 
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Fig 1. The four plots depict (left right, top, bottom) the same configuration of observations. 
The blue dashed ellipses show the contours of the model governing the distribution of the 
majority -here 70 out of 100- of the observations. The orange ellipses show, respectively, 
the FastMCD, FastMVE, SDE and FastPCS estimates of scatter and location. 



lies in it insensitivity to the outliers. As we argue below, this makes the out- 
lyingness index derived from FastPCS both quantitatively and qualitatively 
more reliable. 

2.2. Construction of the PCS Outlyingness. PCS looks for the /i-subset 
of multivariate observations that is most congruent along many univariate 
projections. In this context, we measure the congruence of a given /i-subset 
along a given projection by the size of it overlap with a second subset that 
is optimal (in a sense we make precise below) on that projection. 

In essence, for a given initial /i-subset and a random univariate projection 
we find a second /i-subset that minimizes a criterion on that projection and 
measure the size of the overlap of the two subsets. The PCS criterion is 
based on the observation that a spatially cohesive /i-subset will tend to be 
congruent with these optimal /i-subsets, and a spatially disjoint one will not. 

More precisely, denoting Amk a p x p matrix of observations (we detail 
below how we pick these p observations) and Omfc = ^mfc^- arbitrary 
p- vector b ^ 0, dp^i will denote the orthogonal distances of the XiS to amk'- 

(2.4) dpi{amk) = {xiUmk - hf 

The set of h observations with smallest dpj^{amk) will be denoted as Hmk- 
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For a given subset Hm and direction Umk we define the incongruence index 
of Hm along Umk as 

.ave dpi{amk) 

(2.5) liHm, amk) ■= log ,2 , r • 

ave dpi[amk) 

Tliis index is always positive and will have small value if the projection of 
the members of Hm along amk greatly overlap with the members of Hmk- To 
remove the dependence of equation (2.5) on amk we measure the incongruence 
of Hm by considering the average over many directions: 

(2.6) I{Hm) ■■= ave I{Hm,amk) 

eB{Hm) 

where B{Hm) are all directions orthogonal to a hyperplane spanned by a ^ 
subset of Hm- We call the Hm with smallest I{Hm) the projection congruent 
subset. In essence, The / index measures the incongruence of an /i-subset Hm 
in terms of how much the projection of its members overlap with those of the 
Hmk over many projections. In practice, it would be infeasible to evaluate 
equation (2.6) over all members of B{Hm)- A practical solution is to take 
the average over a sample of K random directions Bx{Hm)- 

The / index of a spatially disjoint /i-subset tends to be higher than that 
of a spatially cohesive /i-subset. This is because when Hm forms a spatially 
cohesive set of observations, \{Hm U Hmk}\ tends to be larger. 

This is illustrated on an example in Figure 2. Both panels depict the same 
set of n = 100 points. These points form two disjoint groups of observations. 
The main group contains 70 points and is located on the left hand side. Each 
panel illustrates the behavior of the / index for a given /i-subset of observa- 
tions. Hi (left) forms a set of spatially cohesive observations, all belonging 
to the main group. H2, in contrast, forms a spatially disjoint set of observa- 
tions and contains members drawn from both groups. For each //^-subset, 
?Ti = {1, 2}, we drew two hyperplanes dmi 

(blue, dashed) and 0^2 (orange). 
The blue dots show the members of {Hm^ Hmi} ■ Similarly, the cockade dots 
(orange inside, blue outside) show the members of {Hm U Hm2}- The black 
diamonds show the members of Hmi {Hm2) that do not belong to Hm- Af- 
ter just two projections, the number of non-overlapping observations (i.e.{ 
{Hm \ Hmi} n {Hm \ Hm2}}) is 8 (m = 1) and 18 (m = 2) respectively. 
As we increase the number of directions Umki this pattern repeats and the 
difference between a spatially cohesive and a disjoint subset grows steadily. 

The / index measures the size of this overlap. For a direction a^fc, the 
members of Hmi and Hm2 not in Hm (shown as black square and diamonds 
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Fig 2. Incongruence index for a subset Hi of spatially cohesive observations (left) and a 
subset H2 of spatially disjoint observation (right). 



in Figure 2) will decrease the denominator in equation (2.5) without affecting 
the numerator, increasing the overall ratio. Consequently, /i-subsets formed 
of spatially disjoint groups of observations will have larger values of the / 
index. Crucially, the I index characterizes a cohesive /i-subset of observations 
independently of the spatial configuration of the outliers. For example, the 
pattern shown in Figure 2 would still hold if the cluster of outliers were 
more concentrated. This is also illustrated in the fourth quadrant of Figure 
1 where the optimal /i-subset found by FastPCS is not unduly attracted by 
members of the smaller cluster of observations located on the right. 

In Sections 3 and 4, we show that this new characterization allows FastPCS 
to reliably select uncontaminated /i-subsets. This includes many situations 
where other algorithms fail to do so. First though, the following section 
details the FastPCS algorithm. 

2.3. A Fast Algorithm for the PCS Outlyingness. To compute the PCS 
outlyingness index, we propose the FastPCS algorithm (2.7). It combines 
ideas from FastMCD (the use of random p-subsets as starting points and an 
iterative concentration step) with some new ones. Throughout, Mp denotes 
the number of starting p subsets. 

Our algorithm can detect exact fit situations (when h or more observations 
lie on an hyperplane). When at least h points lie in an exact fit, FastPCS will 
return the index of those h observations. In real datasets, exact fits are rare. 
Nevertheless, handling them is important for theoretical reasons (Rousseeuw 
1994). 

The second branch of the algorithm starts at h. Step b grows the size of 
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Hfn from ^0 to its final size h = \{n + p + l)/2\ m. L steps, rather than in 
one as is done in FastMCD. This improves the robustness of the algorithm 

when outliers are close to the good data. We find that increasing L does not 
improve performance much if L is greater than 3 and use L = 3 as default. 

(2.7) Algorithm FastPCS 

for m = 1 to Mp do: 

a: Hjn {p+ 1 observations drawn at random from 1 : n} 
ho ■<— the smallest integer for which {^°) > K 

set Hjy^ ■<— |i : d'^^(tm,Sra) ^ '^M,(/io)(^™' '^'")} 
b: for Z = 1 to L do: 

X C^pj(o^mfc) 

Di{Hm) ^ ave '-^ — for alH = 1, . . . , n 

fe=i ave dp Aajnk) 

set hi-k- \{n-p - 1)Z/(2L)] + p + 1 

set Hm |i : Di{Hm) < D(/i;)(i?m)| ('concentration step') 
end for 

K 

c: compute I{Hm) ^ aveI{Hm,amk) 

k=l 

end for 

Keep the subset Hm with lowest I{Hm)- The outlyingness index 
of each observation is given by Di{H^). 

Empirically also, we found that small values for K are sufficient to achieve 

good results and that we do not gain much by increasing K above 25, so we 
set = 25 as the default. That such a small number of random directions 
suffice to reliably identify the outliers is remarkable. This is because FastPCS 
uses directions generated by p-subsets of iJ^. 

Compare this to SDE algorithm which needs a much larger number of 
projections to reliably find the outliers. This is because in SDE the data is 
projected along directions given by hyperplanes passing by p points drawn 
indiscriminately from the entire set of observations. Consequently, when the 
contamination rate is high, most of these p-subscts will be contaminated, 
yielding directions that can end up almost parallel to each other. In con- 
trast, our choice always ensures a wider spread of directions when Hm is 
uncontaminated and thus yields better results. 

Like FastMCD and FastMVE, FastPCS uses many random p-subsets as 
starting points. The number of initial p-subsets, Mp, must be large enough 
to ensure that at least one of them is uncontaminated. For each starting 
p-subset, the time complexity of these algorithms scales as 0{p^ + np'^). The 
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random directions used in the SDE are orthogonal to hyperplanes passing 
by p points. Here also, the number of such hyperplanes must be of order Mp 
to ensure that at least one of them does not pass by any of the outliers. 

The value of Mp grows exponentially with p. In practice this means all 
four procedures become impractical for values of p much larger than 25. All 
four procedures belong to the class of so called 'embarrassingly parallel' algo- 
rithms, i.e. their time complexity scales as the inverse of the number of pro- 
cessors meaning that they are particularly well suited to benefit from mod- 
ern computing environments. To enhance user experience, we implemented 
FastPCS in C++code wrapped in a portable R package (R Development Core 
Team, 2012). 

3. Empirical Comparison: Simulation Study. In this section we 
evaluate FastPCS numerically and contrast its performance to that of the 
SDE, FastMCD and FastMVE algorithms. The last three were computed 
using the R package rrcov with default settings (except, respectively, for the 
number of random directions and starting subsets which for all algorithms 
we set according to equation (3.5)). Each algorithm returns a subset H^, of 
h observations with smallest outlyingness index. Our evaluation criteria are 
the bias and the misclassification rate of these h observations. Below, we 
briefly describe these. 

3.1. Asymptotic Bias. Given a central model an arbitrary distribution 
consider a contamination model 



For a subset of observations H^,, the (asymptotic) bias measures the deviation 



where Ai (Ap) are the largest (smallest) eigenvalues of G^^^'^TG^^^'^ . Eval- 
uating the bias of (t*,^*) is an empirical matter. For a given sample, the 
bias will depend on the dimensionality of the data, the rate of contamina- 
tion and the distance separating the outliers from the good part of the data. 
Finally, the bias also depends on the spatial configuration of the outliers 




-^e = (1 - s)Tuiflu, ^u) + e-7^c(/ic, ^c) • 






THE PCS OUTLYINGNESS INDEX 



9 



(the choice of Tc)- Fortunately, for aflfine equivariant algorithms the outlier 
configurations causing the largest biases are known and so we can focus on 
these cases. 

3.2. Miss- classification rate. We can also compare the outlyingness in- 
dexes in terms of rate of contamination of a subset H. Denoting 1^ the index 
set of the contaminated observations and X(.) the indicator function, the 
misclassification rate is: 



This measure is always in [0, 1], thus yielding results that are easier to com- 
pare across dimensions and rates of contamination. A value of 1 means that 
the H contains all the outliers. The main difference with the bias criterion 
is that the misclassification rate does not account for how disruptive the 
outliers are. For example, when the distance, is small, it is possible for 
Miss.Rate(/c, if*) to be large without a commensurate increase in bias(5*). 
For FastPCS and SDE, using Miss.Rate(/c, -ff*) also has the advantage that 
this criterion does not depend on the auxiliary estimate S"* since it is based 
on the outlyingness index directly. 

3.3. Outlier configurations. We generate many contaminated datasets 
of size n with = Xu U Xc where Xu and Xc are, respectively, the genuine 
and outlying part of the sample. The bias depends on the distance between 
the outliers and the genuine observations which we will measure by 



The bias also depends on the spatial configuration of Xc- For affine equiv- 
ariant algorithms, the worst-case configurations (those causing the largest 
bias) are known. In increasing order of difficulty these are: 

• Shift configuration. If we constrain the adversary to (a) |i7c| > \Eu\ 
and (b) place Xc at a distance v of Xu, then, to maximize the bias, 
the adversary will set Uc = (Theorem 1 in Rocke and Woodruff 
1996) and set /Jc in order to satisfy (b). Intuitively, this makes the 
components of the mixture the least distinguishable from one another. 

• Point contamination. If we omit the constraint (a) above but keep (b), 
the adversary will place Xc in a single point so \Uc\ = (Theorem 2 
in Rocke and Woodruff 1996). 



(3.3) 




(3.4) 
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• If we omit both constraints (a) and (b) , the adversary may set fic = fJ-u 
and choose Ec to obtain a large bias. The barrow wheel contamination 
(Stahel and Maechler 2009) does this. 



Fig 3. The three outlier configurations (Cluster, Point-Mass and Barrow Wheel). The 
outliers are depicted as orange dots. 

We also considered other configurations such as radial outliers, but they 
posed little challenge for any of the algorithms, so these results are not shown. 

3.4. Simulation parameters. We can generate both the uncontaminated 
data Xu and the contaminated data Xc from the standard normal distri- 
bution since all methods under consideration are affine equivariant. For the 
shift and point configurations, we will also generate data from the standard 
Cauchy distribution in order to quantify the sensitivity of each method to 
the tail index of the data. 

For the shift and point configurations we generated the outliers by set- 
ting Sc as either Dp or 10~^Dp {Dp denotes a rank p diagonal matrix) and 
/ic = vd ill which d is the eigenvector of Su corresponding to its smallest 
eigenvalue and r] is chosen to satisfy equation (3.4). We generated the bar- 
row wheel configuration using the robustX package (Stahel and Maechler 
2009) with default parameters. The three configuration types are depicted 
in Figure 3 for n = 100, p = 2, e = 0.4, = 2 and J^u is the bivariate 
normal. The outlying observations are the orange triangles. The complete 
list of simulation parameters follows: 

• the dimension p is one of {4, 8, 12, 16}, 

• the sample size is n = 25p, 

• the contamination fraction e is one of {0.1, 0.2, 0.3, 0.4}, 

• the configuration of the outliers is either shift, point, or barrow wheel, 

• for shift and point contamination, the distance u comes from the uni- 
form distribution on (0,10). The barrow wheel contamination does not 
depend on v. 
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• the number of initial {p + l)-subsets Mp is given by 
(35) M - 

with eo = 0.4 so that the probabihty of getting at least one uncontam- 
inated subset is always at least 99 percent. 

In Figures 4 to 10 we display the bias and the misclassification rate (left 
and right plots, respectively) for discrete combinations of the dimension p, 
and contamination rate e. In all cases, we expect the outlier detection prob- 
lem to become monotonically harder as we increase p and e, so not much 
information will be lost by considering a discrete grid of a few values for 
these parameters. For the barrow wheel, p and e are the only parameters 
we have and so we can chart the results as trellises of boxplots (Deepayan 
2008). 

For the shift and point contamination models, the configurations also de- 
pend on the distance separating the data from the outliers. Here, the effects 
of V on the bias are harder to foresee: clearly nearby outliers will be harder to 
detect but misclassifying distant outliers will increase the bias more. There- 
fore, we will test the algorithms for many values (and chart the results as a 
function) of v. For both the bias and the misclassification rates curves, for 
each algorithm, a solid colored line will depict the median and a dotted line 
(of the same color) the 75th percentile. Here, each panel will be based on 
1000 simulations. 

3.5. Simulation results (a). The first part of the simulation study covers 
the case where there is no information about the extent to which the data 
is contaminated. Then, we have to set the size of the subset of observations 
having positive weight to /i, corresponding to the lower bound of slightly 
more than the majority of the observations. 

In Figure 4 we display the bias (bias (5*)) and the misclassification rate 
(Miss.Rate(/c, -f^*)) curves of each algorithm as a function of v for different 
values of p and e for the shift configuration when is the standard normal. 
All the methods perform well until e = 0.3 and p < 8. For larger values 
of p and I/, the bias curves and misclassification rate of FastMVE and SDE 
clearly stand out for values of u between 2 and 6. Eventually, as we move 
the outliers further away, SDE and FastMVE can again find them reliably. 

When is Cauchy (Figure 5), FastMVE performs significantly worse 
than the other algorithms starting already at e = 0.2 and p > 12. Eventually, 
(e = 0.3 and p > 12) FastMCD also breaks away from E^- SDE performs 
better until e = 0.4, where it is also noticeably affected by the thicker tails of 



12 K. VAKILI AND E. SCHMITT 

2468 2468 2468 2468 





1 1 1 1 

p=4 


1 1 1 1 
p=8 


1 1 1 1 

p=12 


1 1 1 1 
p=16 






1 1 1 1 

p=4 


1 1 1 1 
p=8 


1 1 1 1 

p=12 


1 1 1 1 
p=16 




o 

w 










- 6 

- 2 


o 










- 0.8 

- 0.2 


d 
II 
Lg 










- 6 

- 2 


CM 
O 








\ 


- 0.8 

- 0.2 


CO 

d 










- 6 

- 2 


O 








\\ 


- 0.8 

- 0.2 


d 
II 










- 6 

- 2 


o 








P., 


- 0.8 

- 0.2 



Fig 4. Bias (left) and misclassification rate (right) due to shift contamination for e — 
{0.1, . . . , 0.4}, p = {4, . . . , 16} and Normal-distributed observations shown as a function 
of V. FastMCD , FastMVE , SDK , FastPCS . 



the Cauchy distribution and fails to identify the outhers. FastPCS maintains 
constant and low bias (and misclassification rates) throughout. 
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Fig 5. Bias (left) and misclassification rate (right) due to shift contamination for e — 
{0.1, . . . , 0.4}, p = {4, . . . , 16} and Cauchy-distributed observations shown as a function 
ofu. FastMCD , FastMVE , SDE , FastPCS . 

In Figure 6, we show the results for the more difficult case of point-mass 
contamination when Tu is the standard normal. Starting at e = 0.2, the bias 
curves and misclassification rate of FastMCD become very high (except for 
p = 4). Starting at (e = 0.3 and p > 8) all the algorithms except FastPCS 
fail to reliably find the outliers: looking at the misclassification rate, the 
optimal /i-subsets for SDE, FastMVE and FastMCD even contain a higher 
contamination rate than e. 

The Cauchy case shown in Figure 7 is also interesting. It suggests, again. 
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that FastMCD and FastMVE are very sensitive to fat tails in the distribution 
of the good part of the data. The behavior of SDE is roughly in line with 
Figure 6. Again, we see that FastPCS is the only algorithm that can reliably 
find the outliers. Furthermore, the misclassification rates reveal that the H^, 
found by SDE, FastMCD and FastMVE often contains a higher proportion 
of outliers than the original dataset. 
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Fig 7. Bias (left) and misclassification rate (right) due to point-mass contamination for 
e — {0.1, . . . , 0.4}, p = {4, . . . , 16} and Cauchy-distributed observations shown as a func- 
tion of V. FastMCD , FastMVE , SDE , FastPCS . 

In Figure 8 we show the bias curves for the barrow wheel as boxplots. 
FastMCD performs badly compared to the other methods, and FastMVE 
deteriorates for e larger than 30 percent. It is at e = 0.4 that FastPCS 
demonstrably exhibits superior performance, with SDE producing results in 
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between the other three algorithms. 
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Fig 8. Bias (left) and misclassification rate (right) due to the barrow wheel contamination 
for e = {0.1, . . . , 0.4} and p = {A, . . . , 16}. FastMCD , FastMVE , SDE, FastPCS . 



Across the tests, FastPCS consistently maintains low and stable biases and 
misclassification rates. Furthermore, the performance of FastPCS is relatively 
insensitive to whether v, e and p are low or high. Additionally, in contrast 
with the other algorithms, FastPCS is unaffected by the thickness of the 
tails of Tu- The consistent ability of FastPCS to detect outliers is a highly 
desirable feature, which we characterize as a form of qualitative robustness. 

3.6. Simulation results (h). In this section, we consider the case, impor- 
tant in practice, where the user can confidently place a upper bound on the 
rate of contamination of the sample. To fix ideas, we will set q, or the num- 
ber of observations assumed to follow a model, to approximately q = 3n/4 
(to avoid confusion we will always use h to indicate subsets of approximately 
n/2 observations). For FastPCS, FastMVE and FastMCD we adapt the algo- 
rithms by setting their respective a parameter to 0.75. SDE does not have a 
corresponding parameter, so we take as member of Q^, the (approximately) 
3n/4 observations with smallest outlyingness. Furthermore, we reduce the 
number of starting subsets (FastPCS, FastMCD, FastMVE) and random di- 
rections (SDE), by using equation (3.5), but this time setting eq = 0.2. Then, 
as before, we measure the effect of our various configurations of outliers on 
the estimators (but this time only considering values of e S {0.1,0.2}). 

In Figures 9 and 10, we show the simulation results for shift and point- 
mass contamination. The results for Normal-distributed observations are 
shown in the first two rows, while the last two rows show the results for 
observations drawn from the multivariate Cauchy distribution. The shift 
contamination results in Figure 9 show that when we increase the size of 
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the active subsets, FastPCS still maintains a low bias and misclassification 
rate, but at equivalent rates of contamination, the other algorithms exhibit 
noticeably weaker performance than in Figures 4 and 5. Results in Figure 10 
illustrate that under point-mass contamination, FastPCS again reports good 
results. The other algorithms continue to exhibit larger biases and misclassi- 
fication rates (at equivalent rates of contamination) than in Figures 6 and 7. 
We do not show the results for the barrow wheel because all the algorithms 
perform equivalently. 
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Fig 9. Bias (left) and misclassification rate (right) due to the shift contamination for 
q ~ 3n/4, e — {0.1, 0.2} and p — {4, . . . , 16}. The first (last) two rows are for multivariate 
normal (Cauchy) distributed r.v. FastMCD , FastMVE , SDE, FastPCS . 
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Fig 10. Bias (left) and misclassification rate (right) due to point-mass contamination for 
q ~ 3n/4, e = {0.1, 0.2} and p — {4, . . . , 16}. The first (last) two rows are for multivariate 
normal (Cauchy) distributed r.v. FastMCD , FastMVE , SDE , FastPCS . 

3.7. Small Sample Accuracy. For affine equivariant outlier detection al- 
gorithms there is a trade-off between bias and accuracy (Rousseeuw 1994). 
This is especially problematic in small n and low e settings because then any 
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reduction in bias can be partially undone by losses in small sample accuracy. 
For FastPCS in particular, this is visible in the first rows of Figures 4 and 8: 
in the misclassification plots, we see that for all algorithms the optimal sub- 
sets H^: are uncontaminated by outliers. Yet, the corresponding bias curves 
for FastPCS tend to be somewhat higher than for the other algorithms. 

To improve accuracy in small samples without conceding too much in bias, 
a solution is to add a so-called re-weighting step to the outlier detection 
algorithm. In essence, we replace H^, by a larger subset J+, itself derived 
from H^:. The motivation is that, typically, J_|_ will include a greater share 
of the uncontaminated observations. Over the years, many such re-weighting 
procedures have been proposed (Maronna, Martin and Yohai 2006). Perhaps 
the simplest is the so called hard-thresholding re- weighting. Given an optimal 
subset H^:, the members of J_|_ are: 



Contrary to i/*, the size of J+ is not fixed in advance. This makes it dif- 
ficult to compare algorithms on the basis of bias(S'+) or Miss.Rate(Ic, J+) 
evaluated at contaminated sub-samples X^. Nonetheless, we can still com- 
pare them in terms of the bias of computed at uncontaminated data sets 
Xu and for various sample sizes n. In essence, we measure the finite sample 
accuracy of the various algorithms by considering their biases at uncontami- 
nated datasets as a function of n. In Figure 11, we show the results of doing 
this for n = {100,200, ...,500,599} (shown as the abscissa) and p = 8. We 
set 599 observations as an upper limit because when n > 600 FastMCD 
uses a nested sub-sampling scheme whereby larger datasets are divided unto 
non-overlapping sub-samples of at most 600 observations. 

Figure 11 depicts the median (solid) and 75th percentile (dotted) accuracy 
curves of the algorithms for increasing n when is normally distributed 
(left) and Cauchy distributed (right). Each panel is based on 1,000 experi- 
ments. As expected, FastPCS is noticeably less accurate than the other al- 
gorithms in the Gaussian case. Nonetheless, we maintain that this difference 
is small compared to the reduction in bias achieved by FastPCS on con- 
taminated samples. Furthermore, this gap in performance depends on the 
distribution of the good data: when J^u is Cauchy the accuracy of FastPCS 
is similar to that of the other algorithms. 

4. Empirical Comparison: Case Study. In this section, we illustrate 
the behavior of FastPCS on a real data problem from the field of engineer- 
ing: the Concrete Slump Test Data set (Yeh 2007). This dataset includes 
103 data-points, each corresponding to a different type of concrete. For each 
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Fig 1 1 . Empirical bias at uncontaminated samples for p = 8 as a function of sample size. 
FastMCD , FastMVE , SDE, FastPCS . 

observation we have 7 so-called input variables measuring, respectively, the 
quantity of cement (kg/m3), fly ash (kg/m3), blast furnace slag (kg/m3), 
water (kg/m3), super-plasticizer (kg/m3), coarse aggregate (kg/m3), and 
fine aggregate (kg/m3) used to make the corresponding variety of concrete. 
Additionally, for each observation, we have 3 so-called output variables mea- 
suring attributes of the resulting concrete. These are slump (cm), flow (cm) 
and 28-day compressive strength (Mpa). 

This dataset actually contains two groups of observations collected over 
separate periods. The first 78 measurements pre-date the last 25 by several 
years. An interesting aspect of this dataset is that these two groups largely 
overlap on bivariate scatter-plots of the data, forming a seemingly homoge- 
neous group of observations. Appearances can be deceptive however: when 
considered along all variables jointly, the two groups are clearly distinct. For 
example, denoting Jq {Jn) the subset of 78 (25) members of the early (lat- 
ter) batch of measurements, we find that the closest member of Jjsf lies at 
a squared Mahalanobis distance of over 760 with respect to {to,So)- As a 
point of comparison, this is nearly 32 times larger than Xo99 lo- 

We first ran the SDE, FastMCD, FastMVE and FastPCS algorithms on 
the raw dataset (for the first three, we used the rrcov implementations with 
default parameters). We set the number of random directions in SDE and 
the number of random p-subsets in FastMCD, FastMVE and FastPCS to 
Mp = 2000, as given in equation (3.5). To have comparable results, we will 
use for each estimator dM,i{t*, 5"^,), the vector of statistical distances wrt H^,. 

The Mahalanobis outlyingness indexes are displayed in the first column 
of Figure 12, Concrete (i). Each row corresponds to an estimator. For each 
estimator we drew two boxplots. The first one (blue) depicts the outlyingness 
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Fig 12. Mahalanobis outlyingness index dM,i(i»,5,) for the four estimators and the 
four variants of the concrete dataset. In each panel, the blue (orange) boxplot depicts 
dM,i(t, , St) for the members of Jo (Jn)- 

index for the 78 members of Jq and the second (orange) boxplot for those 
25 members of Jn. All the algorithms are able to unambiguously separate 
the two subgroups of observations: the outlyingness values assigned to the 
members of the more recent batch of observations are notably larger than 
the outlyingness values assigned to any members of Jo- 
in a second experiment, Concrete(ii), we made the original outlier detec- 
tion problem harder by narrowing the distance separating the outliers from 
the genuine observations. To do so, we pulled the 25 outliers towards the 
mean of the good part of the data by replacing the original outliers with 25 
observations of the form ^ {xn+i + to)/'^ for 1 < i < 25, n = 78. We 
denote to be these 25 members of < i < 25. The second column 

of Figure 12 depicts the outlyingness indexes for this second experiment. 
Again, for each algorithm, the first (blue) boxplots depict the values of the 
outlyingness indexes for the 78 members of Jq and the second (orange) box- 
plot for the 25 members of J}^. Note that the members of J)J still form a 
cluster that is genuinely separate from the main group of observations. For 
example, the closest member of J)^* lies at a squared Mahalanobis distance of 
over 190 with respect to {to, So)- For comparison, this is over 8 times larger 
than Xo99 lo- Here, the FastPCS and SDE outlyingness indexes continue to 
clearly distinguish between the two groups. The outlyingness indexes derived 
from FastMVE and FastMCD however fail to adequately flag the outliers. 

In a third experiment, Concrete(iii), we made the original outlier de- 
tection problem harder still by increasing the contamination rate of the 
sample. This was done by adding an additional 25 outliers of the form 
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Xn!^25+k ^ {xn+i + Xn+j)/2 for I < i j,k < 25, n = 78. We denote 
J^* the members of this third set of 50 outliers. The third column of Fig- 
ure 12 depicts the outlyingness indexes for this experiment. Again for each 
algorithm, a blue boxplot pertains to the 78 members of Jq and an orange 
one to the 50 members of J)J*. For SDE, FastMVE and FastMCD we can see 
that increasing the contamination rate of the sample causes the outlyingness 
indexes of the two groups to overlap while the outlyingness index produced 
by FastPCS continues to make a clear distinction. Remarkably, the subsets 
of h observations with the smallest outlyingness chosen by the first three 
algorithms are primarily composed of outliers. 

In a final experiment, Concrete(iv), we narrowed the distance between the 
good data and the outliers (as in Concrete(ii)), and increased the contami- 
nation rate (as in Concrete(iii)). The fourth column of Figure 12 depicts the 
outlyingness indexes for this experiment, again with for each algorithm a 
blue boxplot for the 78 members of Jq and an orange for the 50 members of 
Jj^. Here again, the outlyingness index of SDE, FastMCD and FastMVE fails 
to adequately flag the outliers. However, as with each of the previous exper- 
iments, FastPCS still assigns a larger index of outlyingness to the members 
of the outlying group. 

Overall, we see that the results of the real data experiment confirm those 
of the simulations, at least qualitatively. When the contamination rates are 
small and the outliers well separated from the good part of the data, all 
outlier detection methods seem to perform equally well. As we consider more 
difficult outlier settings however, we see that the FastPCS outlyingness index 
is the only one that identifies the outliers reliably. 

5. Outlook. In this article we introduced PCS, a new outlyingness in- 
dex and FastPCS, a fast and affine equivariant algorithm for computing it. 
Like many other outlier detection algorithms, the performance of FastPCS 
hinges crucially on correctly identifying an /i-subset of uncontaminated ob- 
servations. Our main contribution is to characterize this /i-subset using a new 
measure of congruence of a multivariate cloud of points. This new charac- 
terization was designed to be insensitive to the configuration of the outliers. 

In our simulations, we have focused on configurations of outliers that are 
worst-case for affine-equivariant algorithms, and found that FastPCS behaves 
notably better than the other procedures we considered, often revealing out- 
liers that would not have been identified by the other approaches. In practice, 
admittedly, contamination patterns will not always be as difficult as those 
we considered above and in many cases the different methods will, hopefully, 
mostly concur. Nevertheless, given that in practice we do not know the con- 
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figuration of tlie outliers, as data analysts, we prefer to carry our inferences 
while planing for the worst contingencies. 

Also through simulations, we found that the performance of FastPCS is 
not affected much by the tail index of the majority of the data. For example, 
FastPCS is capable of finding the outliers both when the majority of the 
data is distributed multivariate normal, or is drawn from a heavy-tailed 
distribution, such as the multivariate Cauchy. 

In this article we emphasized the practical aspects of PCS. A number of 
theoretical properties deserve further investigation. In particular, we suspect 
that PCS has maximum breakdown point. Arguments supporting this con- 
jecture are that the maximum biases are low and that the procedure has the 
exact fit property. 
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